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Memory consistency models, or memory models, allow both programmers and program language implc- 
menters to reason about concurrent accesses to one or more memory locations. Memory model specifications 
balance the often conflicting needs for precise semantics, implementation flexibility, and ease of understand- 
ing. Towards that end, popular programming languages like Java, C, and CH — h have adopted memory 
models built on the conceptual foundation of Sequential Consistency for Data-Race-Free programs (SC for 
DRF). These SC for DRF languages were created with general-purpose homogeneous CPU systems in mind, 
and all assume a single, global memory address space. Such a uniform address space is usually power and 
performance prohibitive in heterogeneous SoCs, and for that reason most heterogeneous languages have 
adopted split address spaces and operations with non-global visibility. 

There have recently been two attempts to bridge the disconnect between the CPU-centric assumptions 
of the SC for DRF framework and the realities of heterogeneous SoC architectures. Hower, et al. proposed a 
class of Heterogeneous- Race-Free (HRF) memory models that provide a foundation for understanding many 
of the issues in heterogeneous memory models. At the same time, the Khronos Group developed the OpenCL 
2.0 memory model that builds on the CH — h memory model. The OpenCL 2.0 model includes features not 
addressed by HRF: primarily support for relaxed atomics and a property referred to as scope inclusion. 
In this paper, we generalize HRF to allow formalization of and reasoning about more complicated models 
using OpenCL 2.0 as a point of reference. With that generalization, we (I) make the OpenCL 2.0 memory 
model more accessible by introducing a platform for feature comparisons to other models, (2) consider a 
number of shortcomings in the current OpenCL 2.0 model and (3) propose changes that could be adopted 
by future OpenCL 2.0 revisions or by other, related, models. 

1. INTRODUCTION 

A memory (consistency) model specifies how individual memory operations can be ordered 
relative to one another, giving both system users and implementers the ability to reason 
about concurrent accesses to one or more memory locations. Memory model specifications 
exist at both low-level (e.g., ISA) and high-level (e.g., general purpose programming lan- 
guage) interfaces, and balance the often conflicting needs for precise semantics, implemen- 
tation flexibility, and ease of understanding. 

Sequential Consistency (SC) is an intuitive model that in effect states a program will 
execute as if each operation were completed atomically and one-at-a-time [Lamport 1979]. 
Sequential consistency is easy to reason about but unfortunately prohibits a large number 
of important implementation optimizations. 

For that reason, programming languages for homogeneous CPU systems have started to 
converge on memory model specifications that belong to a class called Sequential Consis- 
tency for Data- Race- Free programs (SC for DRF) [Adve and Hill 1990]. These languages, 
including Java [Oracle 2014], C, and C++ [ISO. International Organization for Standardiza- 
tion 2011], have chosen the SC for DRF framework because it is easy for most programmers 
to understand yet allows many implementation optimizations. SC for DRF specifications, 
guarantee an SC execution, but only if all concurrent accesses to shared memory are pro- 
tected by synchronization such as a mutex. SC for DRF implementations have considerable 
flexibility because if, in some data-race-free program, it is impossible to determine the order 
that two memory operations occur (for example, because they are independent), then an 
implementation has the flexibility to reorder those operations. For more on SC for DRF, 
see section 2.1. 

While SC for DRF has proven to be a good framework for traditional CPU systems, it 
has limitations in platforms like mobile SoCs that contain heterogeneous components with 
fine-grained shared memory. In SC for DRF, races are defined under the assumptions that 



Computer Sciences Technical Report 2014-01. 



A:2 



all synchronization operations (e.g., C++ atomics) complete in a total, global order and 
that synchronization operations have globally-observable side-effects. This assumption is 
reasonable in systems with conventional snooping or directory coherence, but is difficult to 
enforce when, for example, coherence is maintained through heavyweight cache maintenance 
operations (as is done in many GPUs [Hechtman et al. 2014]). For that reason, existing 
heterogeneous platforms provide scoped synchronization operations that have non-global 
side-effects. For example, OpcnCL has a barrier operation that only guarantees visibility 
among work-items (equivalent to CPU threads for memory model purposes) in the same 
work-group (a cluster of work- items sharing physical resources). 

Recently, Hower, et al. proposed a class of memory models called Sequential Consistency 
for Heterogeneous-Race-Free (SC for HRF) [Hower et al. 2014] that merge SC for DRF 
with scoped synchronization operations. They introduced the concept of a heterogeneous 
race, which can occur when concurrent accesses are protected with synchronization of "in- 
sufficient" visibility. They identified several options for defining sufficiency, and discuss the 
usability and implementation trade-offs of each choice. 

While the original SC for HRF models do an excellent job of taming the complexity of 
scoped synchronization for a simplified system model, real heterogeneous languages must 
deal additional complications that can make it hard to apply the insights of SC for HRF. For 
example, OpcnCL, which at its core is also a race-free memory model, supports well-defined 
but non-SC executions, disjoint address spaces, and limited observability of some memory 
locations. In contrast, SC for HRF assumes all well-defined executions are sequentially 
consistent and a system model with a single, flat address space. 

In this paper we show how to add four new features to the HRF framework that together 
allow us to fully specify a real model like OpenCL. We also show how to restrict a program 
so that users of systems with complex models can revert to the original, and far simpler, 
SC for HRF models. 

In particular, we show how to add scope inclusion, relaxed atomics, observability, and 
multiple address spaces to the HRF framework. Scope inclusion permits more well-defined 
programs at essentially zero implementation cost. It was a property known in the origi- 
nal SC for HRF work but excluded due to the complexity it adds to the formalization. 
Relaxed atomics permit well-defined but non-SC executions. They exist in languages like 
OpenCL and C++, and are exceedingly difficult to understand. Observability is necessary 
because systems like OpenCL allow what is called coarse-grain allocations in which a loca- 
tion mapped into a global address space may only be observable by a subset of the agents 
(e.g., to represent a buffer allocated into memory that is not coherent at the full-system 
level) . Finally, we add multiple address spaces to account for the fact that some locations in 
heterogeneous systems exist in an entirely separate address space. For example, in OpcnCL, 
the local memory region that represents a scratchpad cache is an entirely different address 
space from the global memory region that represents coherent caches. However, OpenCL 
has operations that can synchronize the two address spaces. 

In summary, we make the following contributions to the state of the art: 

HRF- * -relaxed We extend the definition of Heterogeneous-race-free to support the 
complications of a real heterogeneous platform. This includes support for a property 
called scope inclusion, relaxed synchronization that could result in non-SC executions, a 
notion of location observability, and multiple address spaces. (Section 4). 
- Equivalence to SC for HRF We show how to constrain programs so that they result 
in an SC execution. This allows the majority of users to ignore the significant complexity 
discussed in this paper and instead reason in terms of SC for HRF. 
Describe and Clarify OpenCL We describe the OpenCL memory model using the 
HRF framework. In doing so, we also clarify several features of the OpcnCL 2.0 memory 
model that are handled informally in the specification. 
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Thread tl : 

101: X = 1; 
102: Y = 2; 

103: atomic_store(&F, 100); 
Thread t2 : 

201: while ( at omic .load (&F) ! 
202: Rl = X; 
203: R2 = Y; 

Fig. 1. Independent operations 101 and 102 can be reordered in SC for DRF. Initially X = Y = F = 0. 

Propose OpenCL Extensions After describing OpcnCL in terms of HRF, we use 
the HRF insights to demonstrate how the OpenCL memory model could permit more 
well-defined applications without introducing extra burdens on an implementation. 

To our knowledge this work is the first to extend Heterogeneous-Race-Free to handle 
the complexities of an industrial heterogeneous memory model in general and the first to 
provide an alternative formalization of OpenCL 2.0's memory model in particular. 

2. BACKGROUND 

In this section we summarize the work on data-race-free and heterogeneous-race-free mem- 
ory models, as well as some of the concepts implemented in existing heterogeneous memory 
models, particularly OpenCL. 

2.1. Data- Race- Free Memory Models 

In 1990, Adve and Hill [Advc and Hill 1990] defined a class of memory consistency models 
collectively termed Sequential Consistency for Data- Race- Free (SC for DRF). These models 
enable high-performance implementations, have precise semantics, and are relatively simple 
to understand. SC for DRF models describe rules that programs must follow in order to 
avoid data races. In the absence of races, an SC for DRF implementation will guarantee an 
SC execution. In Adve and Hill's original formulation, the system provides no guarantees 
when a program contains a data race, though subsequent work has developed specifications 
that still provide basic ordering guarantees such as write causality [Oracle 2014] or that 
raise an exception when a non-SC execution occurs [Marino et al. 2010] [Lucia et al. 2010]. 

Informally, a data race is usually understood to mean any two ordinary (for C++ that 
means non-atomic) memory accesses that arc unprotected by synchronization and could 
therefore occur "at the same time" [Boehm and Advc 2008]. An SC for DRF model guar- 
antees that any execution of a data-race-free program will appear sequentially consistent - 
though is under no obligation to ensure that the actual memory access completion order, if 
it could be observed, is sequentially consistent. 

Many memory operations in a program are independent from each other, and an SC 
for DRF implementation is free to reorder any such independent accesses. In Figure 1 an 
implementation can perform the accesses on line 101 and 102 in any order as long as they 
complete before the accesses on line 202 and 203, respectively. In the example, it would 
be impossible for thread t2 to determine the order that 101 and 102 complete without 
introducing a data race, and therefore there is no valid execution that could determine 
the global completion order of accesses 101 and 102. Rather, all we can say is that line 
101 "happens before" line 202 and that line 102 "happens before" line 203. This example 
highlights the fact that SC for DRF is a relaxed model - implementations must maintain the 
appearance of sequential consistency but are not obligated to produce an actual sequentially 
consistent order of memory accesses. 



// synchronization store 
= 100); // synchronization load 
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Fig. 2. OpenCL execution hierarchy and example mapping to a GPU cache organization. Consistency may 
be maintained at programmer-defined synchronization points by flushing/invalidating caches to the visibility 
specified by the scope of the operation. 



2.2. Heterogeneous Execution Models 

To understand why conventional SC for DRF memory models are insufficient for hetero- 
geneous platforms, we need a basic understanding of heterogeneous execution and memory 
models. In this section, we describe a heterogeneous platform using OpenCL terminology, 
though other platforms like HSA [HSA Foundation 2012] and CUDA [NVIDIA Corporation 
2013] have similar organizations. 

OpenCL exposes a hierarchical execution environment to reflect the fact that some actors 
in a platform have a special relationship to others. An OpenCL application is composed of 
a number of concurrent actors, including host threads that execute on a conventional CPU 
and device work-items that can execute on a variety of attached devices from GPUs to 
FPGAs. As shown in Figure 2a, work-items belong to several groups that capture locality 
and visibility relations with respect to other actors in the system. First, all host threads and 
device work-items belong to the single system group. All work-items in a single NDRange 
execute on the same device. The device itself may comprise several Compute Units (similar 
to a CPU core). Work- items belonging to the same work- group execute together on a single 
Compute Unit. Finally, work-items in a sub-group may execute together as part of a SIMD 
vector (for example, when running on a GPU). Separate sub-groups must execute indepen- 
dently: in effect a sub-group is an abstraction of a hardware thread. It is useful to expose 
these groups at the language level because locality is critical in most heterogeneous devices. 
For example, to achieve good performance on a GPU, an OpenCL application should ensure 
that there is little divergence, either in control flow or memory accesses, among work-items 
in a sub-group (as these may execute in lock-step in a hardware vector unit). 

2.3. Scopes 

Since their inception, heterogeneous platforms like OpenCL have incorporated the execu- 
tion hierarchy into the memory model to reflect the fact that communication costs vary 
depending on the actors involved. For example, in a typical GPU implementation, shown 
in Figure 2b, that synchronizes by using heavyweight cache maintenance operations like 
flush/invalidate [Hechtman et al. 2014], it is much less costly to synchronize among work- 
items in a work-group that share an LI cache than it is among work-items in an NDRange 
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that only share an L2 cache, and both are more efficient than synchronizing with the host 
through an L3 cache or DRAM. 

To keep the discussion concrete, we describe the OpenCL 2.0 notion of scopes here. Scope 
definitions are similar in other languages. Informally, a scope is a subset of actors (e.g., work- 
items in a single work-group), and a scoped synchronization operation only affects other 
actors within the same subset. In OpenCL, each atomic operation or fence is performed 
with respect to a single scope that can be specified by a programmer. OpenCL 2.0 defines 
five scopes that correspond to the execution hierarchy shown in Figure 2a: 

(1) memory _scope_work_item 

(2) memory _scope_sub_group 

(3) memory _scope_work_group 

(4) memory _scope_device 

(5) memory _scope_all_svm_devices 

For simplicity, in the rest of the paper we often abbreviate a scope name as msjwi, mssg, 
mS-Uig, ms-dev, and ms_svm, respectively. The mssvm scope corresponds to shared virtual 
memory, and includes all actors (work- items and host threads) in an OpenCL execution 1 . 

When dealing with scopes, it is useful to distinguish the static name for a scope from the 
dynamic group of actors that correspond to a scope at runtime. 

Definition 2.1. Static scope The scope named in the program 
text. For example, the static scope of the OpenCL atomic operation 
atomic -load -explicit{h A, . . . , memory _scope_work-group) is work-group, or ms_wg. 

Definition 2.2. Dynamic scope The set of agents in the hierarchy at a 
given scope. For example, the dynamic scope of the OpenCL atomic operation 
atomicJoad-explicit(SzA, . . . , memory scope -work -group), when executed by work-item 
WI, is the set of work-items {WI, WI', . . .} that execute together in a particular work- 
group WG. 

Given two OpenCL atomic operations, it is possible for them to share the same static 
scope but have different dynamic scope, e.g., two work-items in different work-groups each 
executing an atomic with static scope work-group. Two operations with identical dynamic 
scope will always have the same static scope. 

We say that two static scopes are equivalent, written 51 == sms S2, if their syntactic 
scopes are the same. We say two dynamic scopes are equivalent, written SI ==dms S2, if 
they correspond to exactly the same set of actors. 

2.4. Heterogeneous- Race- Free Memory Models 

When a program contains synchronization operations that are performed with respect to 
different scopes, it introduces the possibility of what Hower et al. [Hower et al. 2014] call a 
heterogeneous race. A heterogeneous race occurs when a program correctly protects two ordi- 
nary accesses with synchronizing atomics such that they cannot occur simultaneously (that 
is they are intuitively data-race-free) but fails to use sufficient dynamic scope to guarantee 
visibility. They showed that a heterogeneous race, like a normal race, can lead to undefined 
behavior in a heterogeneous system. To help describe the types of heterogeneous races that 
can occur, they introduced the class of memory models called Sequential Consistency for 
Heterogeneous-Race-Free (SC for HRF). 

Hower, et al. have defined two alternative SC for HRF memory models that trade off 
implementation flexibility and program expressiveness. The first, called HRF-direct, requires 
that both the source (e.g., producer) and destination (e.g., consumer) actors synchronize 



1 With some caveats to allow support for devices without coherent memory as we will see in Section ??. 
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Device Dl 

Work-group X 
Work-item XI 

101: T = 1; 

102: A. store (1, memory_scope_work_group ) ; 

Work-item X2 

201: while (! A. load ( memory _scope_work_group )) ; 
202: R2 = T; 

203: B. store (1 , memory _scopc_dcvice ) ; 

Device D2 

Work-group Y 
Work-item Yl 

301: while ( ! B . load ( memory_scope_device ) ) ; 
302: R3 = T; 

Fig. 3. This program contains a race (between lines 101 and 302) in HRF-direct, but is heterogeneous-race- 
free in HRF-indirect. Note that we use simplified OpcnCL syntax in this example for consistency throughout 
the paper. T, A and B arc all initialized with the value 0. 



with respect to the same dynamic scope whenever communicating through shared memory. 
HRF-direct allows aggressive system optimizations and appears to be a safe abstraction 
of many existing heterogeneous systems. Alternatively, the authors also proposed HRF- 
indirect that allows two communicating actors to synchronize with inexact dynamic scope 
when there is a transitive chain of synchronization through a third actor. HRF-indirect 
reduces the allowable hardware optimizations but may permit higher-performing software 
on today's heterogeneous implementations. 

In Figure 3 we show an example of a program that contains a race in HRF-direct but is 
heterogeneous-race-free in HRF-indirect. In HRF-direct, the store at line 101 races with the 
load at line 302 because work-items XI and Yl have not synchronized in the same scope. 
In HRF-indirect, the program is heterogeneous-race- free because work- item X2 forms a 
transitive synchronization link between work- items XI and Yl. 

To implement HRF-indirect, a system must ensure that the visible side-effects of syn- 
chronization operations include prior accesses from all actors in the scope of the operation, 
not just from the actor performing the synchronization. To the best of our knowledge, all 
current CPU/GPU heterogeneous systems meet this requirement, but the restriction may 
prevent future optimizations. For example, in a GPU that synchronizes by flushing dirty 
data from a cache (e.g., at line 203 in Figure 3), HRF-indirect would prevent an optimiza- 
tion where only the data modified by the actor requesting synchronization is made visible 
to other actors. More details on the differences between the two models are discussed in 
Hower, et al. [Hower et al. 2014]. 



3. NEW FEATURES 

In this section we describe the four HRF feature additions that are required to describe the 
complexities of industrial models. The first, called scope inclusion, allows synchronization 
operations performed with different scopes to directly pair without forming a heterogeneous 
race. The second is support for relaxed, non-SC atomics that are available in both C/C++ 
and OpenCL 2.0. The third adds the ability to describe coarse-grain memory regions with 
limited observability, and the fourth adds support for multiple address spaces in the same 
platform image. In the remainder of this section we discuss the motivation and concepts 
behind each feature. 
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Device Dl 

Work-group X 
Work-item XI 

101: T = 1; 

102: A.store(l, memory _scope_devicc ) ; 

Work-item X2 

201: while ( ! A. load ( . . . , mcmory_scopc_work_group ) ) ; 
202: R2 = T; 

Fig. 4. Example of correct synchronization in a model with scope inclusion support. Work-items XI and 
X2 can directly synchronize with each other using different scopes because one scope is a subset of the 
other. T and the atomic variable A are both initialized to 0. 

Device Dl 

Work-group X 
Work-item XI 

101: A.storc(l, memory _scope_workg_group ) ; 

102: Rl = B.load(..., memory _scope_device ) ; 

Work-item X2 

201: B.store(l, memory _scope_device ) ; 

202: R2 = A.load(..., memory _scope_work_group ) ; 

Fig. 5. A race-free program in both H RF-direct and HRF -indirect. Implementation must ensure there is 
a total observable order of all operations, even though they arc performed with respect to different scopes. 
Atomic variables A and B are both initialized to 0. 

3.1. Scope Inclusion 

Figure 4 shows an example ol an OpenCL-like program that synchronizes using operations 
performed with respect to different scopes. The dynamic scope of the operation on line 
102 (device D) includes the dynamic scope of the operation on line 201 (work-group X). 
Intuitively, one might expect this example to work as intended, such that the value of R2 
is guaranteed to be 1, because the release to device scope D is in effect a communication 
to all of the actors contained in the work-group scope X, plus more. However, Figure 4 is 
a race in both H RF-direct and HRF -indirect. 

Definition 3.1. Scope inclusion For now, let us say that two scoped synchronization 
operations, O s and 0' s , have dynamic scopes S and S' respectively. The operations arc 
inclusive, written O s ~i nc i 0' s ,, if cither the dynamic scope S of O s is a subset of the 
dynamic scope S' of O f s , or vice versa. 

As Hower, et al. have previously observed, reasonable implementations of both H RF-direct 
or HRF-indirect will likely ensure that Figure 4 works as expected, and thus we expect the 
implementation cost of scope inclusion is negligible. To see why, consider the example in 
Figure 5 that is race-free in both H RF-direct or HRF-indirect. The program is race-free, 
so an implementation must guarantee a sequentially consistent execution: a total order of all 
operations. Notably, this means that an implementation must establish an order between the 
atomic loads and stores even though they are performed with respect to different, inclusive, 
scopes. It would be exceptionally difficult for an implementation to dynamically distinguish 
the difference between the programs in Figure 4 and Figure 5, leading us to believe the 
implementation cost of scope inclusion is negligible. 

Aside from being trivial to implement, scope inclusion could also lead to more composable 
functions. Without scope inclusion, a library function cannot in general concurrently modify 
a data structure because the callers may use a different scope of synchronization on the 
same data structure. With scope inclusion, a library can safely use the largest possible 
scope (usually mssvm) regardless of how the callers synchronize. 
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001: struct Task; 

002: struct MsgQueuc { 

003: int _occupancy ; 
004: 

005: Task* dequeue () { 

006: if ( atomic_load (&_occupancy ) = 0) { 

007: return NULL; 

008: } else { ... } 

009: } 

010: 

Oil: } globalQucue ; 

Thread tl : 

101: void periodicCheck ( ) { 

102: Task* t = globalQueue . dequeue () ; 

103: if (t != NULL) 

104: t . execute ( ) ; 

105: } 

Fig. 6. A C++ program where SC for DRF unnecessarily prohibits some operation reordering. 
3.2. Relaxed Atomics 

The implementation flexibility afforded by SC for DRF (and discussed in Section 2.1), and 
by extension SC for HRF, is good enough for the majority of programs and implementations. 
However, the pure models do have constraints that can unnecessarily degrade performance 
in specific cases. For that reason, some derivatives languages like C/C++11 and OpenCL 
2.0 support relaxation of the sequential consistency requirement in limited cases while still 
staying a core data-race-free model. These relaxations are exceptionally difficult to under- 
stand, and are intended to be used rarely and only by expert programmers [Boehm and 
Adve 2008]. 

We show an example of when SC for DRF/HRF can be overly restrictive in Figure 6. In 
this example, a service thread periodically checks whether a client thread has requested a 
service by reading from an incoming message queue. If there are no messages, the service 
thread continues to do other, unrelated work involving only local data. Let's say that re- 
sponse time for the incoming request is critical, meaning that the periodic check must be 
frequent, but that requests are rare. In the common case the queue is empty, and for that 
reason a high-performance implementation might wish to avoid the overhead of synchroniza- 
tion (e.g., a low-level fence instruction) when checking the queue for an incoming request. 
In a strict SC for DRF/HRF model, there is no way to check the occupancy of the shared 
queue without using synchronization; any attempt to read the state of the queue using an 
ordinary load or store would form a data race with the requestor. Thus, the program may 
be unnecessarily slow on a system with high synchronization costs. 

To support higher-performance programs, both C/C++11 and OpenCL 2.0 include what 
they call low-level atomics, which are atomic operations explicitly marked with an ordering 
property weaker than sequential consistency. Specifically, programmers can mark an atomic 
access as a release, an acquire, or as a relaxed operation 2 . An access marked as a release 
or an acquire has global ordering side-effects on ordinary loads and stores but the atomic 
access itself does not have to be sequentially consistent relative to other atomic accesses. 
This is similar to how synchronization operations are treated in the well-known RCpc 
model [Gharachorloo et al. 1990]. Accesses with relaxed ordering, on the other hand, have 
no side-effects on ordinary loads and stores and can likewise be reordered relative to other 
non-SC accesses. 



2 For simplicity, we treat consume ordering like release order in this paper 
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001 
002 
003 
004 
005 
006 
007 
008 
009 
010 
Oil 
012 
013 
014 



0) { 



struct Task; 
struct MsgQueue { 
int _occupancy ; 

Task* dequeue () { 

if ( atomic_load (&_occupancy ) 

return NULL; 
} else { ... } 

} 

int occupancy () { 

return atomic_load (& .occupancy , memory .order .relaxed ) : 

} 

} globalQueue ; 



Thread tl : 



101 
102 
103 
104 
105 
106 
107 
108 



void periodicCheck ( ) { 

// goal: avoid global synchronization on occupancy check 
if ( globalQueue . occupancy ( ) > 0) { 
Task* t = globalQueue . dequeue () ; 
if (t != NULL) 
t . execute ( ) ; 

} 

} 



Fig. 7. An example showing how relaxed atomics can lead to better performing programs. 

Using relaxed atomics, the service thread in Figure 6 can avoid costly synchronization 
in the common case, as shown in Figure 7. In this example, the service thread reads the 
occupancy of the shared queue using a relaxed atomic that will not produce any synchroniza- 
tion side-effects. The thread will only perform costly synchronization (through the dequeue 
function on line 104) if it finds the queue has content. In this example, the implementation 
is under no obligation to ensure that the occupancy check is sequentially consistent with 
respect to the rest of the execution. 

While relaxed atomics can be useful in limited circumstances, we reiterate that relaxed 
atomics are quite complex and error prone. Their inclusion in C/CH — h was controversial, 
and their use is generally discouraged [Boehm and Adve 2008; Sutter 2012]. The pitfalls 
of relaxed atomic complexity are not limited to C++ users and implcmcntcrs; the C++11 
standard [ISO. International Organization for Standardization 2011] has a known issue with 
the formulation of relaxed atomics such that while the intent was clear to readers, in an 
effort to avoid out-of-thin-air values, the committee inadvertently added text that required 
relaxed atomics to behave as if they were SC. This was a far stronger change than was 
intended [Boehm and Demsky 2014]. 

Despite the challenges and shortcomings of relaxed atomics [Adve 2010], we address 
them here because OpenCL standards committee has included them in the OpenCL 2.0 
specification. 

3.3. Observability 

Heterogeneous models separate memory into regions that are shared between discrete de- 
vices at coarse-grain synchronization points. For example, OpenCL provides coarse-grain 
buffers that are allocated with API calls and that only become visible at coarse-grain syn- 
chronization points that map or unmap the region. Notably, the visibility of memory in a 
coarse-grain buffer is not generally affected by fine-grain synchronization such as atomics. 

Coarse-grain regions complicate an HRF model because they set a strict limit on the 
observability of a location within the region. For example, in OpenCL, a coarse-grain buffer 
allocated to a particular device will never be visible to an agent outside of that device, even 
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if a work-item on the device synchronizes globally. OpenCL has this feature in order to 
allow the runtime to use any device-specific physical memory, such as non-coherent GPU 
DRAM. 

We incorporate observability into an HRF model in Section 5. 
3.4. Multiple Address Spaces 

In addition to coarse-grain regions within a single address space, heterogeneous platforms 
may also provide some memory regions with an entirely different address space. And, to 
complicate things further, these address spaces may have different ordering rules. In the 
case of OpenCL, the local and global address space orders are almost entirely separate, as 
is evident from the separate local and global flags that can be passed to OpenCL fences. 

Address spaces play a fundamental role in describing data locality. They allow developers 
to explicitly manage where data lives in the memory hierarchy during program execution. 
For example, OpenCL's local memory generally corresponds to a physical scratchpad mem- 
ory, which is why it is modeled as an independent address space from coherent shared 
memory and that is only visible to a subset of agents. Because synchronization on local 
memory does not affect global memory, implementations can keep local memory operations 
fast (e.g., an implementation does not need to flush caches on a local memory release). 

We show how to incorporate multiple address spaces into an HRF model in Section 5.1. 

4. HETEROGENEOUS-RACE-FREE-RELAXED (HRF-*-RELAXED) 

In this section we formalize the extensions described in Section 3 into a fully-specified HRF 
model. Here we assume the basic notion of scope inclusion described in Definition 3.1, 
though show later that the model can also support more restricted rules for scope inclusion, 
such as those in OpenCL 2.0, in Section 6.3. 

Like Hower, et al. [Hower et al. 2014] before us, we propose two versions of our relaxed 
HRF models that differ only in whether or not they support transitive synchronization 
involving different scopes. We will present the model for the non-transitive variant, called 
HRF -direct-relaxed, first and then will show the necessary change to support scope tran- 
sitivity in Section 4.6. 

The formal definition of HRF-direct-relaxed in Figure 8 is considerably more complex 
than its predecessor HRF-dircct, due mostly to to the fact that it does not start with the 
same a-priori assumption that all candidate executions are sequentially consistent. Luckily, 
if a program only uses mo.se atomics (the default in OpenCL) and only synchronizes with 
pairs of atomics using the exact same scope (the only synchronization currently defined by 
OpenCL), then the HRF-direct-relaxed model is equivalent to the simpler HRF -direct 
model developed by Hower, et al. [Hower et al. 2014]. Thus, the majority of users do not 
need to concern themselves with the complexities of HRF-direct-relaxed. Proves of the 
equivalence of the two models is given in the Appendix. 

4.1. Model Structure 

System Model We define the model for an abstract system consisting of a collection of 
disjoint memory locations. Loads (stores) read (write) a value from (to) a single location. 
For simplicity, assume for now that all loads and stores are aligned to their natural width 
and that there are no overlapping loads or stores of different widths. Some loads and stores 
are marked as atomic, and all atomic operations are qualified with a specific ordering (e.g., 
mosc) and scope (e.g., ms-wg). A thread of execution is a set of operations, including 
loads and stores, performed by a single agent (host thread or work- item). Given the values 
returned from memory, a thread of execution must respect the control flow semantics of the 
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If this set is not 

empty, the 
execution of 'P' 
is undefined. 



Actual execution 
of 'P' will come 
from this set 



Fig. 8. Logical structure of the HRF relaxed models 



static program. We call the order of operations that a thread of execution performs program 
order and assume that it is a total order 3 . 

Logical Flow In Figure 8 we show how to use an HRF relaxed model to determine the 
possible executions of a program. Given a static program, a user will first construct a set 
of all plausible executions that result when each load observes either the initial value of 
a location or the value of some other store in the execution to the same location. Many 
plausible executions will eventually be discarded from consideration because they violate 
the rules of the model. Next, the set of plausible executions are reduced to a set of candidate 
executions that respect the apparent orders and rules of the HRF model, e.g., those listed 
in Section IV of Figure 9. A program will result in an undefined execution if any candidate 
execution contains a heterogeneous race, e.g., as defined in Section V of Figure 9. Otherwise, 
a user can precisely determine the set of possible executions: 

If all candidate executions are race-free, then a conforming implementation 
must produce one of the candidate executions. 

Notes The reader should be careful not to interpret the rules regarding candidate execu- 
tions in Section IV as rules that are always strictly enforced by an HRF -direct-relaxed 
implementation. For example, while it is correct to say that there is an apparent total 
order of mo.se atomic operations, regardless of scope, in the execution of a heterogeneous- 
race-free program (the ~si order in Figure 9), an implementation is under no obligation to 
provide a total order of all mosc atomics that are executed by software. If the executing 
program contains a heterogeneous race, then the implementation can execute correctly in a 
fashion that would break the total order because we do not attempt to define that execu- 
tion. Also note that the orders in Section IV are apparent orders, and are not necessarily a 
strict indication of the order in which an implementation must complete operations even if 
an execution is heterogeneous-race-free. Given the values observed during an execution, it 
may be impossible to determine the actual order in which some operations completed. For 
example, the operations may be independent. When constructing the apparent orders for 
purposes of the model, these independent operations are put in an arbitrary order relative to 
one another. For example, if a heterogeneous-race-free execution contains two mosc atomic 



3 Some languages have an undefined evaluation order in certain situations such that program order may be 
a partial order. We omit this complication from the model since it is well-known how to handle it and would 
only distract from the main contributions 
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Scope Inclusion: We say two atomic operations Os[£] and 
For now, assume the operator from Definition 3.f: 

=1 o' a , =(SC S') V (5' C S) 
Consistent Orders: We say two orders, ~ct and b are consistent iff there does not exist any pair 



of operations O and O' such that O ~~ct O' and O' b O. In other words, ("^ U 6 ) + is irrefiexive 



II. Conflict Definitions 



Ordinary Conflict: Memory actions 0[£] and 0'[£'] conflict iff £ = at least one is a store, 
and at least one is ordinary (non-atomic). 

Atomic Conflict:: Two atomic operations Os[£] and O^/^'] conflict iff £ — £', at least one is a 
write or read-modify-write, and Os ^i nc i 0' s , . 



III. Executions 



Plausible Execution: An execution E of program P is plausible iff: 

(f) In E, the value of any load L[£] is either the initial value of i or the value produced by some 

other store, S[£], to the same location in E 
(2) as long as S[£] is not control, address, or data dependent on L[£] (this prevents an operation 

S[£] from providing a value before it is known whether or not S[£] can exist in E). 

Candidate Execution: A candidate execution is any plausible execution that respects the appar- 
ent orders and rules in Section IV below. 



IV. Consistent Apparent Orders And Load Values in a Candidate Execution 



Program Order (po a ): Operations O and 

O' are in program order, written O pot O' iff both are from the same agent a and O comes before 
O' in the execution control flow. When referenced without a subscript, po refers to (J pot for all 

agents A. 

Sequentially Consistent Atomic Order (~s£): There is an apparent total order, si of all the 
memory -order _seq_cst atomic operations, sc' must be consistent with po. 

Coherent Order (cohi): There is an apparent total order, cohe, of all accesses by all actors to any 
single location £. coht must be consistent with si and pb for all £. The read and write components 
of an atomic read-modify-write must be adjacent in coht. When referenced without a subscript, 

coll refers to (J coht for all locations L. 

teh 

Scoped Synchronization Order (sot). Given a release memory action, Rels[£], and an acquire 

memory action, Acq^/ [£] , Rels[£] sot Acq S '[£] iff Rels[£] ~inci Acq S i[£], Rels[£] cohe Acqs[£], and 
S and S' both include agent a. 



Heterogeneous-Happens-Before-Direct-Relaxed(/iftfe.dr): The union of the irrefiexive tran- 
sitive closures of all scope synchronization orders with program order: 



U((p£u^)+) 



CO A. 



Further, hhb.dr cannot contain a cycle and is consistent with 
Value of a Load: A load L[£] observes the value produced by the most recent store S[£] in coh\:, 

l Computer Scie nces ^ Technic al R, cport 2014-01. 

valueof(L[£]) = valueof(S[£}) : (S[£] coht L[£]) A 0S [£] : S[£] coht S'[£] coht L["" 



Fig. 9. Part 1 of the formalization of H RF-direct-relaxed. 
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Heterogeneous Race: A candidate execution contains a heterogenec 
(ordinary or atomic) actions O and O' are unordered in hhb.dr: 




licting 



-.(O hhb.dr O' V O' hhb.dr O) 



Heterogeneous-race-free Program A program is heterogeneous-race-free iff all of its candidate 
executions are heterogeneous-race-free. 

Racey Program: Any program containing a heterogeneous race is considered racey. 

Result of a Heterogeneous-race-free Program: The result of any heterogeneous-race-free 
program will be one of its candidate executions. 

Result of a Racey Program The outcome of a racey program is undefined on a conforming 
implementation. Notably, this means that the orders defined in Section IV above do not have to 



accesses with the same static work-group scope but different dynamic scope (because they 
are performed in different work-groups), an implementation does not need to ensure that 
those two atomics are serialized because an agent cannot observe the actual completion 
order without introducing a heterogeneous race. 

In Section 4.5 we show how a system can take advantage of these observations to imple- 
ment performance optimizations, especially in relation to scoped operations. 

4.2. Discussion 

Scope Inclusion To add scope inclusion, an HRF model must exclude inclusive synchro- 
nization from the set of potentially racing operation pairs. In the formalism, this is han- 
dled by the definition of a synchronization conflict. This is a relatively simple change over 
HRF-direct] the majority of the complexity in the H RF -direct-relaxed formalism comes 
from the support for relaxed atomics. 

Relaxed Atomics A goal of HRF -direct-relaxed is to define non-SC executions, so we 
cannot start with the same simplifying assumption made in HRF-direct that all candidate 
executions are sequentially consistent. The difference is not unlike the changes made when 
moving from DRFO [Adve and Hill 1990] to C++11 [ISO. International Organization for 
Standardization 2011], and which is explained in detail by Boehm and Adve [Boehm and 
Adve 2008]. 

Apparent Orders We explicitly define two apparent orders that must exist in a candidate 
execution and that are not explicitly defined in HRF-direct 4 . The first, sc is a total order of 
all atomics using mosc order. As we have already discussed, this does not necessarily mean 

that an implementation must serialize all mo_sc atomics that it executes. The second, cc% is 
a total order of all loads and stores to the same location. This constraint essentially restricts 
HRF to systems that support hardware coherence, though that coherence mechanism does 
not need to be a conventional read-for-ownership style CPU protocol (e.g., a MESI protocol), 
and can instead be something more basic similar to what modern GPUs implement. 



4 However, these orders do in fact still exist in HRF-direct because all candidate executions are sequentially 
consistent 




Fig. 9. Part 2 of formalization of H RF-direct-relaxed 
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Device Dl 

Work-group W 
Work-item Wl 

101: X. store (1, ms.dcv ) ; 

Work-group X 
Work-item XI 

201: Y. store (1, ms.dcv); 

Work-group Y 
Work-item Yl 

301: A = X. load ( ms.dev ) ; 
302: B = Y. load (ms.dcv) ; 

Work-group Z 
Work-item Zl 

301: C = Y. load (ms.dcv) ; 
302: D = X. load (ms.dcv) ; 

Fig. 10. Assuming that X and Y are initialized to 0, in HRF '-direct-relaxed, an implementation is allowed 
to produce the non-SC result A = C = 1 and B = D = 0 



The other orders defined in the model, so^ and hhb.dr, are derived from ~si and con. 
Scoped synchronization order is defined per-work-item because of scope inclusion. It is not 
sufficient to say, for example, that there is an order of synchronization operations within a 
single scope (as H RF-direct does) because agents can directly synchronize using different 
scopes. It would also be too strong to say there is an order among all synchronization regard- 
less of scope. Therefore, HRF -direct-relaxed effectively defines an order of synchronization 
among any group of work-items that could directly synchronize. 

Load Value The value of an atomic load in H RF -dir ect-relaxed will either be the value 
of the most recent ordinary or atomic store in hhb.dr (that is, the most recent store in 
a sequentially consistent order) or the value of some atomic store that is unordered with 

respect to the load in hhb.dr. For example, HRF-relaxed permits the heterogeneous-race- 
free example in Figure 10 to obtain the non-SC result A = C = 1 and B = D = 0 in some 
executions. We show in the appendix that when a load observes a value that does not come 
from the most recent store in hhb.dr, that the value will come from a store that in unordered 
with respect to both (a) the load, and (b) the most recent store in happens-before relative 
to the load. 



4.3. Sketch of Equivalence to H RF-direct 

As expected, with the HRF-relaxed formulation we can guarantee that any heterogeneous- 
race-free program which only uses mosc atomics will always result in a sequentially consis- 
tent execution (see appendix for formalization). With this property, it is safe for non-expert 
users to revert to the more simple HRF-direct model and thereby never have to reason 
about the valid but complex non-SC ordcrings. 

At a high level, we prove that HRF-direct-relaxed is equivalent to HRF-direct for any pro- 
gram that only uses mosc atomics and exact-scope synchronization by proving that those 
conditions always produce sequentially consistent candidate executions. In a sequentially 
consistent candidate execution, the value of the most recent store in coherent order must 
also be the most recent value in happens-before (because both respect program order and 
happens-before respects coherent order). Because the definitions in Section V of Figure 9 
are the same as those in HRF-direct, the two models are equivalent. 
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4.4. Other Considerations 

The HRF model presented here takes inspiration from the C++ formalization of relaxed 
atomics, and as such has inherited a known issue in the C++ model relating to so-called "out 
of thin air" values [Boehm 2013]. Solutions to the problem have been proposed, though they 
are controversial [Boehm and Demsky 2014]. Thus, we do not take a stance in this paper 
nor propose any new solutions, instead focusing on the novel aspects of HRF- * -relaxed 
relating to heterogeneous systems. 

We also do not handle other complications that could arise in a real model in an attempt to 
keep our presentation as simple as possible. These other complications, such as unaligned 
loads and partial program order, have well-known and uncontroversial solutions, and we 
therefore assume could be easily addressed. 



4.5. HRF-Direct-Relaxed Base Implementation 

We assume a system organized like the one if Figure 2b. We define mapping c(s) and C{s) 
of dynamic scope to physical cache(s) as follows: 

c(Sub-group) nil 

c (Work-group) The local write buffer of the executing agent. 

c(Device) The local write buffer and LI cache of the executing agent. 

c(System) The local write buffer, LI cache, and L2 cache of the executing agent. 

C(Sub-group) The local write buffer of the executing agent. 
C (Work-group) The local LI cache of the executing agent. 
C(Device) The local L2 cache of the executing agent. 

C(System) Main memory (DRAM). 

We assume the agents execute loads and stores in program order and that the write 
buffer drains in program order. Ordinary requests waiting in input queues for caches may 
be reordered if the requests are to different locations. Ordinary requests to the same location 
cannot reorder, though loads (stores) may coalesce into a single memory system request if 
there is no store (load) in program order between them. Requests cannot reorder around 
any other request that originated from a sequentially consistent atomic. 

Caches maintain valid, dirty, and invalid states. On a flush, all request queues are drained 
and all dirty data is evicted to the next level of cache or main memory. On an invalidate, all 
request queues are drained and all valid data is invalidated. No new request can be serviced 
while a maintenance operation is pending. Caches will only return valid or dirty data to an 
agent. Dirty lines are cleaned periodically by writing back to the next level of cache. A line 
is guaranteed to be cleaned a finite time after the line was written. 

A load completes when the value it will return is read from a cache or memory. A store 
completes when the value it produces is written into the local write buffer. A read-modify 
write completes when the store portion completes at the target scope. 

The system operates as listed in Figure 11. 



4.6. HRF-lndirect-Relaxed 

We define a variant of HRF- * -relaxed that supports transitive synchronization through 
scopes similar to HRF-indirect. HRF -indirect-relaxed is identical to HRF -direct-relaxed 
in all ways except for a fully transitive happens-before relation shown in Figure 12. 



5. HRF-RELAXED-OBSERVABLE 

Coarse-granularity sharing of data refers to memory regions that are shared between de- 
vices, with or without shared virtual addresses, but whose updates only propagate between 
nodes at explicit synchronization points, rather than immediately at the point of perform- 
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Ordinary Load 


On an ordinary load, the system searches for a valid copy of the 
location starting in the local sub-group write buffer and continuing 
out to main memory. When a valid copy is found, that value is 
written into any cache that has been searched, the value is returned 
to the agent, and the load terminates the search. 


Ordinary Store 


On an ordinary store, the system inserts the store value into the 
local write buffer. 


Relaxed Atomic 
Load 


Same as an ordinary load. 


Relaxed Atomic 
Store 


Same as an ordinary store. 


Relaxed Read- 
modify-write 


The read-modify-write operation is performed atomically at cache 
or memory corresponding to C(S). 


Seq.cst Atomic 
Load w/ scope S 


All write buffers, and/or caches in c(S) are flushed/invalidated. 
After the cache operations complete, the load proceeds the same as 
an ordinary load. Operations later in program order cannot execute 
before the atomic load completes. 


Seq est Atomic 
Store w/ scope S 


All write buffers and/or caches in c(S) arc flushed. After the cache 

AnPrtii'iA'nc r'/^TnT^ f hp cf ArA nrAP^rinc oc a n Arrlmt)nr CTArn 1 1 1 \ 
dllUllti LAJIIILJIL.IjL^ LllX- oLUIt jJl U^A_-U(_lt> <at> all \JL (llLidbL y SUUlt. V_/JJ 

erations earlier in program order must be completed before the 
atomic store can execute. 


Seq.cst Atomic 
Read-modify- 
write w/ scope 
S 


All write buffers and/or caches in c(S) arc flushed. After the cache 
operations complete, the read-modify-write completes in the cache 
or memory corresponding to C(s). Operations earlier in program 
order must be issued and completed before the atomic read-modify- 
write can execute and operations later in program order cannot 
issue before the read-modify-write completes. 


Fig. 11. Memory subsystem actions in the H RF-direct-relaxed example implementation. 



Heterogeneous-Happens-Before-Indirect-Relaxed(hft&ir): The irreflexive transitive closure 
of program order with the union of all scope synchronization orders: 



(p&U (J £5*)+ 

a£A 

Implementations shall ensure there is no cycle in hhb.ir. 

Fig. 12. Happens-before order in H RF-indirect-relaxed. The full model follows identically to that in 
Figure 9 but with hhb.dr replaced with hhb.ir. 

ing a synchronizing operation (an atomic) in the core memory model. To add this support 
to the HRF-relaxed models, and hence to fully support the range of current and future 
heterogeneous programming models, we must extend the model to include the concept of 
observability as seen in Figure 13. The goal is to allow for sets of memory locations that are 
migrated by some external entity in and out of given visibility zones. This migration will 
be performed at coarse synchronization points. In OpenCL these synchronization points 
may be event dependencies between data-parallel kernels, or map and unmap calls that 
synchronize with the host thread. 
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Maximum Dynamic Memory Scope: For a given memory location L the maximum dynamic 
scope MaxS(L) at which a write to location L may become visible, irrespective of the scope of 
subsequent synchronization operations. Maximum scope is dependent on an actor being a valid 
modifier of a given region as defined outside of the memory model. 

Observability (0(A, B)): An operation A is observable by an operation B iff 

AMaxS(locationof(A)) ~incl B MaxS(locationof (S)) A (locationof(A) == locationof (B) ) . 

Heterogeneous Race: A heterogeneous race occurs between two conflicting (ordinary or 
synchronization) memory actions A and B, iff A is not observable to B or A and B are unordered 
in hhb.dr: 

-^(0(A, B) A ((A, B) G hhb.dr V (B, A) G hhb.dr)) 

Value of a Load: In a heterogeneous-race-free program, a load observes the most recent observable 
ordinary or synchronization store in in cok. 

Fig. 13. Formalization of HRF-direct-relaxed-observable as a set of changes from HRF-direct-relaxed in 
Figure 9. As for HRF-indircct-relaxed, HRF-indirect-relaxed-observablc is modified with the happens-bcforc 
order from Figure 12. 



At any given point in time a given location will be available in a particular set of scope 
instances out to some maximum instance and by some set of actors. Only memory operations 
that are inclusive with that maximal scope instance will observe changes to those locations. 

The default maximum scope instance for traditional memory locations in a fully coherent 
system is system scope. For a given point in time a coarse-grained allocation will be visible 
to a particular device D, and hence synchronization can only happen at device scope. . 
Under the terms of the model actions to that allocation from another device E would not 
be able to observe actions from D. 

The addition of observability definitions to the model offers three obvious benefits in 
describing a model like OpenCL. The first is that it formalizes the rules around coarse- 
grained memory which is the context we saw in the earlier code sequence. The second 
is that it offers an opportunity to simplify the rules about sequential consistency, beyond 
even the simplification that HRF offers. The third is we can consider simplifying the local 
memory/global memory separation and bringing both under the terms of the same clean 
model without explicitly separate orders and join points. 

Note in particular that even in the presence of observability, and like scopes in general as 
described in [Hower et al. 2014], a heterogeneous-race- free program will produce a consistent 
total coherent order. The intuition behind this is that while different memory regions may 
strictly order differently, no such reordering is observable to clients. This is little different 
from the way that between synchronization operations memory updates may reorder in the 
basic SC for DRF models. 



5.1. Multiple happens-before orders 

OpenCL and related languages aim to support a wide range of very relaxed memory archi- 
tectures. One consequence of this is that OpenCL has been designed such that the global 
and local address spaces are covered by almost entirely separate happens-before relations. 
These relations may be rejoined carefully using specific fences detailed in the specification. 

We can represent this by instantiating the model for each address space independently All 
atomic operations will order separately for each distinct address space with its own order. 
We further assume that the final order for the entire program results from the transitive 
closure of the individual orders. 

The precise additions to the happens-before order that would cause the closure to bridge 
the individual orders might vary from one language to another. In OpenCL's case it is the 
specific fences that create this connection. 
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Per- Address-Space Scoped Synchronization Order(so a as ): A release memory action, Rs, 
appears before an acquire memory action, As' in so a ,al iff Rs ~mci Agi , Rs occurs before As in 
co k, Rs and As are both in address space as, and S and S' both include agent a. 

Per- Address-Space Heterogeneous-Happens-Before-Indirect-Relaxed(fe/i&.ir as ): The 

union of the irrefiexive transitive closures of all scope synchronization orders in address space as 
with program order: 

aeA 

Bridging-Synchronization-Order(bso): (bso): A release memory action, Rso, ordered by a per- 
address space happens-before order O appears before an acquire memory action, A S i 0 >, ordered 
by a per-address space happens-before order O' , in bso iff Rso ~inci A S 'o' an( 3 Rso occurs before 
As'o' in cok. 

Heterogeneous-Happens-Before-Indirect-Relaxed(ftftb.ir): The transitive closure of the 
union of the happens-before orders for both local and global locations along with the bridging 
orderings: 

( M hhb.ira S U bso) + 

as€AS 



Fig. 14. Formalization of the multiple address-space extension to HRF-indircct-relaxed as a set of changes 
from Figure 9. The heterogeneous race, values of loads and, optionally, incorporation of Figure 13 re-apply 
identically to hhb.ir. 

Figure 14 shows this formalization as applied to the OpenCL case of having two separate 
orders, but the concept would generalize to any number of distinct happens-before orders 
were that to be required. 

6. DETAILS OF THE OPENCL 2.0 MEMORY MODEL 

The OpenCL 2.0 memory model maintains the features of the OpenCL 1.x model, including 
the execution hierarchy and basic synchronization discussed in Section 2.2, and extends it to 
include support for shared virtual memory (SVM) in a global address space. SVM allocations 
are distinguished as being cither fine-grain or coarse-grain, which affects the observability 
of memory as well as the types of synchronization that are supported. 

When using what OpenCL calls fine-grain SVM with platform atomics and restricting 
to sequentially-consistent atomic operations, the system appears similar to the basic model 
assumed by the original HRF work. A pointer to global memory is valid on any actor, 
and the host CPU thread does not need to explicitly manage data allocations, transfers, 
or mapping in device memory that would otherwise be required. In addition, the model 
provides non-SC atomic operations similar to those found in C/C++, but augmented with 
the ability to control scope visibility. With these features, it is possible to create programs 
that take advantage of the hardware support for system-wide shared memory in recent 
SoCs. 

Like C++ and HRF, OpenCL 2.0 fundamentally follows a race-free memory model, such 
that only race- free programs have well-defined behavior. OpenCL race-free executions are se- 
quentially consistent by default (that is, when using the default atomic ordering and scope), 



5 We note here that the OpenCL 2.0 specification adds FIFO data structures called pipes and includes sup- 
port for image data structures with some extensions beyond those available in OpenCL 1.2. The restrictions 
OpenCL 2.0 applies to both of these leaves them outside of the core memory model and therefore out of 
scope of this discussion. 
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but can be relaxed to produce well-defined but non-sequentially-consistent executions with 
explicitly relaxed atomic operations (see Section 6). 

In the remainder of this section, we provide the nuanced details of the OpenCL 2.0 
memory model and then show how to describe it in terms of HRF. 



6.1. OpenCL Address Spaces 

Figure 15 shows OpenCL 's address spaces, in which each colored box represents a different 
address space. A work-item has access to a private memory visible only to itself, a local 
memory that is shared between work-items in the same work-group, and finally a global 
memory shared between all concurrently executing work-items as well as the host. The 
address spaces are disjoint and are assumed to not overlap. In OpenCL 1.x, these address 
spaces were explicit. In OpenCL 2.0, a programmer can map the above address spaces into 
a single generic virtual memory map, though that mapping does not change the properties 
of the memory model. For example, virtual addresses corresponding to private memory 
locations correspond to different physical locations for each work-item. 

In general, OpenCL's memory model treats address spaces separately and an operation 
on one address space does not affect the others. The local and global address spaces are 
the only address spaces that are both shared between work-items and writable, and are 
therefore governed by the properties of the memory model. Private memory is never visible 
outside of a work-item, and so its ordering properties are handled trivially. The memory 
orderings on local and global are independent and atomic operations applied to one do not 
affect memory orderings on the other. 

It is, however, possible to join the global and local address space memory orders such 
that two work-items can communicate between local and global memory. For example, a 
work-item could write data into global memory and then synchronize via a local memory 
flag with another work-item in its own work-group. When doing so, the work-item would 
need to issue a special fence operation that joins the two address spaces. 

More specifically, OpenCL uses the operator in a memory fence to 
combine local and global memory. When two memory fences each specify 
CLK -LOC AL-M EM _FENCE\CLK -GLOBAL-MEM -FENCE, the individual 
happens-before relations from local and global memory are merged for two issuing 
work-items. This synchronization corresponds to the bridging operations that form bso in 
Figure 14. No other synchronization operations appear in bso for the OpenCL model. 
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6.2. Forms of Shared Virtual Memory 

Shared virtual memory allocations in OpcnCL 2.0 can be categorized in three ways: 

(1) Coarse-Grained buffer 

(2) Fine-Grained buffer 

(3) Fine-Grained system 

Coarse-grained SVM buffers 6 only guarantee consistency between different agents at 
coarse-grained synchronization points (map and unmap operations or inter command de- 
pendencies) and at the granularity of the entire memory allocation. An implementation is 
only required to present the same virtual address space for a coarse grained SVM allocation 
to the subset of devices using that buffer. 

At coarse-grained synchronization points, an implementation may copy the data to and 
from physical locations that are visible only to a specific device. For example, an implemen- 
tation might copy data in/out of a GPU's physically separate and non-coherent DRAM. As 
we will see, this property plays an important role in the definition of sequentially consistent 
atomics and the perceived single total order that is expected to exist for such operations. 

Fine-grained buffers can be supported with and without platform atomics. Without plat- 
form atomics visibility between different OpenCL devices is, as for coarse-grained buffer 
allocations, only guaranteed at explicit synchronization points like kernel beginning and 
end. Unlike coarse-grained buffers, visibility is defined at a byte granularity and does not 
require map and unmap operations to ensure visibility on the host. 

Fine-grained system SVM extends fine-grained support to all host memory. This extends 
the set of locations visible to OpcnCL 2.0's memory model but has no effect otherwise. 

When platform atomics are enabled, memory consistency for both fine-grained SVM 
modes may be achieved by atomic operations directly without the need to wait for coarse 
synchronization points. 

Visibility for fine-grained memory conceptually maps to the following scopes: 

Fine-Grained with platform atomics. 

memory 'SCope-alLsvm devices or platform wide visibility. Allows for concurrent access 
from any agent that can participate in the OpenCL shared virtual address space and 
with ordering and visibility arising directly from atomic operations. 
Fine-Grained without platform atomics. 

memory _s cope -device or device only visibility. Concurrent access by different agents to 
the same byte is not permitted (formally such access is defined as a data-race) and 
coherency is guaranteed only at well-defined synchronization points such as kernel begin 
and end. 

Coarse-grained memory affects the observability of memory locations. For example, in 
Figure 16, if the storing actor and the loading actor run on the same device then happens- 
before guarantees that both T and U will be 1. If, however, they run on different devices it 
is possible that T will remain 0 while U was 1. While the synchronization through Z guar- 
anteed a happens-before relationship, coarse-grained memory properties do not guarantee 
the visibility beyond their maximum scope (in this case Device). 

Both of the more coarse forms of shared virtual memory in OpenCL can be covered by 
bounding observability. The maximum dynamic memory scope for a coarse-grained allo- 
cation, or a fine-grained allocation without platform atomics, is device scope. Therefore 
happens-before actions on a different device will not order operations relative to happens- 
before actions on the current device. Any store performed on a coarse-grained buffer by one 



6 We include non-SVM allocations in this category because they behave the same way according to the 
memory model. 
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Device Dl 

Work-group X 
Work-item XI 

101 C = 1; 

102 Y = 1; 

103 Z. store (1, 



mo_sc , ms_svm) 



Device D2 

Work-group Y 
Work-item Yl 

201 if (Z. load (mo.sc , 

202 T = C; 

203 U = Y; 

204 } 



ms_svm ) ) { 



Fig. 16. Coarse-grained memory ordering in OpcnCL. If C is in a coarse-grain buffer, Y and Z are in 
fine-grained buffers, and Dl ^ D2, then T will be 0 and U will be 1. 



device will not be in the observable coherent order for the location on another device, as 
described in Figure 13. 

The difference between fine-grained allocations without platform atomics and coarse- 
grained allocations is in the definition of a race. For a coarse-grained buffer the required 
map, unmap and event dependencies operations add an entry to the coherent update order 
for every memory location in the allocation, thus conflicting with updates performed by any 
other device. A fine-grained allocation will only update side effects caused by the executing 
kernel and thus conflict only at update-granularity. 

6.3. Scopes 

OpcnCL 2.0 has a different definition of scope inclusion from the basic one in Definition 3.1. 
This arises for two reasons. First, the different types of shared memory complicate the notion 
of dynamic scope equivalence. In particular, the dynamic SVM scope is defined differently 
depending on whether or not operations use fine-grained buffers. If two operations both use 
fine-grained buffers, then the SVM scope includes all actors in the system. Otherwise, SVM 
scope is limited to actors on the same device. 

Second, OpenCL only supports synchronization between atomics of with identical dy- 
namic scope, which is similar to the rules for the original HRF-direct model. 

With these two changes, we can define the scope inclusion property for OpenCL. Let us 
define fga(0) to be true if the operation O affects a location in a fine-grained with platform 
atomics memory allocation. Then: 

Definition 6.1. OpenCL 2.0 dynamic scope equivalence Two scoped synchroniza- 
tion operations , Os and 0' s ,, with static scopes S and S' that execute on subgroups SG 
and SG', work-groups WG and WG' , and devices D and D' have equivalent dynamic scopes 
iff: 



(S== 


S' 


—= ms_sg 


A SG = SG') 


V 


(S== 


S' 


== msjwg 


A WG = WG') 


V 


(S== 


S' 


== msjdev 


A D==D') 


V 


(S== 


S' 


—= ms_svm 


A fga(O s ) A fga(0' s ,)) 

A (^fga(O s )y^fga(0' s ,))A D == D') 


V 


(S== 


S' 


== mssvm 





Definition 6.2. OpenCL 2.0 scope inclusion (clscin) Two scoped synchronization 
operations, Os and 0' s ,, are OpenCL 2.0-inclusive, written O s ~ c lincl 0' s , iff the dynamic 
scope of Os is equivalent to the dynamic scope of 0' s ,. 

6.4. HRF-OpenCL 

We can now describe the OpenCL 2.0 memory model in terms of HRF concepts. 

OpenCL supports scope transitivity , so we base the model on HRF — indirect — relaxed. 
We need to include the OpenCL notion of scope inclusion from Definition 6.2, observabil- 
ity for coarse-grained allocations, and local and global address spaces. We also include 
OpenCL's restrictive definition of sequentially consistent atomics. In the end, we arrive at 
the model in Figure 17, which picks up at Section IV from Figure 9. 



Computer Sciences Technical Report 2014-01. 



A:22 



IV. Consistent Apparent Orders And Load Values in a Candidate Execution 

Program Order (po"): Operations O and O' are in program order, written O ph O' iff both are 
from the same agent and O comes before 0' in the execution control flow. 

Sequentially Consistent Atomic Order (s^): There is an apparent total order, ~si of all the 

memory -order _seq-cst operations iff all atomics use scope mosvm and are to a fine-grained al- 
location with platform atomics or all atomics use scope moAev and none are to a fine-grained 
allocation with platform atomics. ~sh must be consistent with pb. 

Coherent Order (cok): There is an apparent total order, cok, of all accesses by all actors to any 
single location, cok must be consistent with sc" and ph. 

Local Scoped Synchronization Order (sol,1) : Given a release memory action, Rels, and an 

acquire memory action, Acqs', Rels so~l% Acqs' iff Rels ~ciinci Acqs', Rels cok Acqs, S and S' 
both include agent a, and Rels and Acqs' both touch locations in local memory. 

Global Scoped Synchronization Order (sog,1): Given a release memory action, Rels, and an 

acquire memory action, Acqs', Rels soc.i Acqs' iff Rels ~ciinci Acqs', Rels cok Acqs, S and S' 
both include agent a, and Rels and Acqs' both touch locations in global memory. 

Local-happens-before (Jhl)): The irreflexive transitive union of program order and local scoped 
synchronization order: 

(p&U [JsoZ^) + 

Global- happens-before (glib): The irreflexive transitive union of program order and global 
scoped synchronization order: 

aeA 

Bridging Synchronization Order (bso): A release fence, Rso, ordered by either local or global 
happens-before order O appears before an acquire memory action, As>o', ordered by either local 
or global happens-before order O' in bso iff Relso ~ciinci Acq S 'o' and Rso appears before A S 'o' 
in cok. 

OpenCL-Happens-Before(cifti) : 

(ZhSu ght)U bso) + 
clhb cannot contain a cycle and is consistent with cok. 

Observability(0(A, B): An operation to global memory A[£\ executed on device D is observable 
by another operation to global memory B[£'] executed on device D' iff £ = £' and (fga(A) A 
fga(B))v(D = D'). 

An operation to local memory A[£] is observable to an operation B[£'] iff £ = £' and both are 
executed in the same work-group. 

Value of a Load: A load L[£] observes the value produced by the most recent observable store 
S[£] in cot 

V. Races 

Heterogeneous Race: A candidate execution contains a heterogeneous race iff two conflicting 
(ordinary or atomic) actions A and B are not observable or are unordered in clhb: 



Fig. 17. H RF-OpenC L . We re-use Sections I-III from Figure 9, but substitute clincl for incl. 
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7. SIMPLIFICATIONS TO THE OPENCL MODEL USING HRF-IN DIRECT-RELAXED 

7.1. Simplifying the local/global ordering separation 

Unfortunately, maintaining two separate ordcrings for local and global memory complicates 
the OpenCL memory model. For most code we can view local and global memory entirely 
separately, such that we instance the entire HRF-indirect-relaxed model twice, once for local 
accesses and once for global accesses. We saw this in Figure 14. 

In addition to the combining of orders, local memory is only observable to work-group 
scope and thus updates are never visible to other work-groups. It is a quirk of the OpenCL 
model that sequential consistency may not be applied simultaneously to allocation with 
platform atomics and allocations without platform atomics, but which can be applied to 
global allocations without platform atomics and local allocations, even though similar ob- 
servability restrictions apply in both cases. 

By treating local memory as a range of locations L such that MaxS(L) is 
memory scope .work .group local memory operations need not become visible to other work- 
groups, even when added into a single happens-before ordering. This would allow a simpli- 
fication of the OpenCL definition to a single happens-before ordering in the memory model 
and further trivially allow a single sequentially consistent total order S to apply to local 
memory as well as global order, in all cases as viewed by a valid observer. 

There may be other valid reasons for maintaining a separate ordering for local memory, 
for example, separate hardware scratchpad memories might use different operations with 
separate ordering guarantees to the global memory cache hierarchy. However, the same 
might apply to the use of buffers that do and that do not support platform atomics (for 
example, the need to use PCI-express atomics rather than local cache atomics) in the same 
application, and no separate order is currently maintained for those cases. In particular, 
the generic address space applied to local memory maintains a separate ordering for local 
addresses from that of global addresses. Two seemingly identical pointers both passed to 
a single function might have accesses ordered entirely separately when used. Applying a 
single happens-before order to both address spaces would remove this concern. 

This simplification would improve the usability of the overall model for developers. 

7.2. Generalizing sequential consistency 

The HRF models already demonstrate that a heterogeneous race can be present across 
scopes even in the presence of a sequentially consistent (SC) ordering. The principle behind 
this is the same as that behind the DRF models in general: that an ordering only matters for 
operations that are well-defined, everything else can be relaxed. As a result, while OpenCL 
limits SC ordering to particular scopes, HRF shows that this ordering could be carried 
through all scopes and still be well-defined. In effect a clean extension of the model that 
OpenCL already discusses. 

The current OpenCL 2.0 specification limits SC to either one of two situations. For all 
buffers b and sequentially-consistent memory operations o in a given execution: 

V(6, o)( ((scopeof(o) — memory scope .all svm -devices) A fga{b)) V 
((scopeof(o) = memory scope -device) A ~^fga(b)) ) 

This results in a weaker ordering in any program that combines a fine-grained-with- 
atomics allocation with device-scope, or all-svm-devices scope with coarse-grained or non- 
atomic fine-grained allocations. This is a clear composability problem because the scope 
is controlled by kernel code and the allocation type is controlled by host code, with no 
guarantee that the two are written by the same developer. 

HRF-indircct already shows how, in the absence of these coarser allocations, sequentially- 
consistent properties can be extended across scopes, removing the need for the restriction 
to a single scope. Observability allows us to assume a single SC ordering on all buffer types 
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with well-defined semantics. Operations to coarse allocations will be sequentially consistent 
to the operating device. Other devices are not valid observers and hence the order they see 
the operations in is undefined. 

8. CONCLUSION 

In this paper we have described how to extend the class of Heterogeneous-race-free memory 
consistency models to incorporate four complex features of industrial memory models. This 
includes support for non-sequentially-consistent operations, a property called scope inclu- 
sion, limited observability of memory locations, and multiple address spaces. By building 
from the more basic HRF-direct and/or HRF-indirect models, we have shown how users 
of industrial models can restrict their programs to comply with a pure SC for HRF model 
and ignore the hard-to-understand complications. We have shown this explicitly with the 
OpenCL 2.0 model. 

Using our formalization, we have shown how OpenCL could be extended to support a 
simpler notion of local memory and a wider range of sequentially consistent executions. 
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A. HRF-DIRECT-RELAXED MODEL PROOFS 

In this section we prove that the HRF-direct-relaxed mandates a sequentially consistent 
execution for all heterogneous-race-free programs. 

Lemma A.l. In a heterogeneous-race-free execution, the value observed by an ordinary 
load, L[£], is the value produced by the unique most recent store, S[£], to the same location 

£ in hhb.dr with respect to L[£]. 

PROOF. To restate using our notation, we are trying to show that given an ordinary load 
L[£] and the store S[£] that L[£] observes: 

(S[£\ hhb.dr L[£]) A ($S'[£\ : S[£] hhb.dr S'[£] hhb.dr L[£\) 

We first prove the left-hand side of the conjunction. 

We know by the definition of a load value that S[£] coh L[£]. Because we are consid- 
ering a heterogeneous-race-free execution and because L\£\ is ordinary, we also know that 

(L[£] hhb.dr S[£}) V (S[£] hhb.dr L[£]). Because hhb.dr is by definition consistent with cofi, 
then we know that the left-hand side of the conjunction cannot be true, and therefore 
L[£] hhb.dr S[£] as expected. 

Now we must show that there is no other store S'[£] that is ordered after S[£] but before 

L[£] in hhb.dr. Let us assume that S'[£] exists. Because S'[£] is to the same location £, we 
know that S'{£\ L[£}V L[£] S'[£]. US'[£] c~ct L[£], we know that S'[£] mh~ S [£] coh\ L{£\ 
because L[£] observed the value of S[£}. However, because con is consistent with hhb.dr, we 
know that both S'[£] cofi S[£] and S[£] hhb.dr S'[£] (our starting assumption) cannot be true. 
Therefore, L[£] con S'[£] must be true, but that again forms a contradiction with our initial 
assumption that S[£] hhb.dr S'[£] hhb.dr L[£]. Therefore, S'[£] cannot exist as expected. □ 

Lemma A. 2. In a heterogeneous-race-free execution, the store, S[£] that an atomic load, 
Ls[£], observes is either the most recent store to the same location in hhb.dr (same as an 
ordinary load), or is a store that is unordered with respect to Ls[£] in hhb.dr. 

Proof. There are two cases to consider. Either S[£] is the most recent store in hhb.dr, 
or it is not. If it is the most recent store, then the same analysis for the proof of Lemma 
A.l applies and we can prove that L observes the value produced by S. 

If S[£] is not the most recent store in hhb.dr, then we can show that S[£] does not occur 

after Ls[£] in hhb.dr nor before the most recent store S'[£] that occurs before the load in 

hhb.dr. Since S\£] is not the most recent store in hhb.dr, does not occur before the most 



recent store hhb.dr, and does not occur after Ls[£\ in hhb.dr, we can show that S[£] is not 

ordered with respect to Ls{£] in hhb.dr. 

We show that S[£] does not occur before the most recent store before Ls[£] in hhb.dr by 

contradiction. If there is a store S'[£] such that S[£] hhb.dr S'[£] hhb.dr L$[£], Ls[^\ would 

have to observe the value of S'[£] because cofi must be consistent with hhb.dr. Thus S'[£] is 

a contradiction and cannot exist. ^ 

We similarly show that S[£] does not occur after L s [£] in hhb.dr by contraction. If 

Ls[£] hhb.dr S[£], then S[£] forms a contradiction because S[£] coh Ls[£] must also be 
true because it provides a value, and by definition coh* must be consistent with hhb.dr. 
Thus, if Ls[£] observes the value of S[£], then S[£] cannot be ordered with respect to Ls[£] 

in hhb.dr. □ 
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Lemma A. 3. In a heterogeneous-race-free execution that only uses sequentially consis- 
tent atomics, the value observed by an atomic load Lg[£] is the value produced by the most 

recent store S[£] to the same location in hhb.dr. 

Proof. If L>s[£] is atomic, then we know from Lemma A. 2 that S[£] is either the most 

recent store in hhb.dr or some other store S'[i] that is unordered with respect to L s [£] in 

hhb.dr. We will show that the later case cannot be true. 

We know that S'[£] cannot be atomic because if it were, we could conclude S'[£] st L s [£] 

(both are mo_sc atomics, sk is consistent with cofa, and L s [£] observed the value of S"[£]). 

Then we can also conclude S'[£] hhb.dr L$[£] by the defintion of happens-before, which is a 
contradiction. Thus, S'[£], if it existed, can not be an atomic. 

We also know that S' [£] cannot be ordinary, because if it were then S' [£] and L$ [£} would 
form a heterogeneous race (unordered in happens-before). 

Thus, S'[£] cannot exist, and therefore Ls[£] must observe S[£], the most recent store to 

the same location in hhb.dr. □ 

Theorem A. 4. All executions of a heterogeneous-race-free program that only use mo_sc 
atomics will be sequentially consistent on an hr f -direct-relaxed implementation. 

Proof. We know from Lemma A.l and Lemma A. 3 that the value of any load L[£] in 
a heterogeneous-race-free execution using only mo_sc atomics comes from the unique store 
S[£] to the same location that is most recent w.r.t. L\£] in hhb.dr. hhb.dr cannot contain 
a cycle, and thus we can construct an apparent total order of execution by ordering any 
unordered operation in hhb.dr by program order and agent id. Because hhb.dr is consistent 
with ph, the total order of execution is sequentially consistent. □ 

B. IMPLEMENTATION PROOFS 

To prove that the example system in Section 4.5 of the main document is a valid implemen- 
tation of HRF-direct-relaxed, we show that the implementation produces the following 
required apparent orders during the execution of a heterogeneous-race-free program: 

Program Order We assume valid agent execution engine implementations that issue 
loads and stores to the memory system in valid program order. 

- Coherent Order We can construct a total order of all accesses to the same location 
that is consistent with program order. 

- Value of a Load We can show that a load observes the value most recently stored in 
the coherent order. 

Sequentially Consistent Synchronization Order We must prove two properties 
hold. First, that all agents will see seq.cst atomics in the same order. Second, that the 
order observed by all agents is consistent with program order. 

Heterogneous-happens-before We must show that hhb.dr exists, that it is consistent 
with coh, and that there is no cycle in hhb.dr. 

First, we will prove that all of the above orders exist and then will show that the imple- 
mentation follows the HRF-direct-relaxed model. 

Lemma B.l. The example system produces an apparent total order of all accesses to 
the same location that is consistent with program order. 

Proof. All loads and stores are sent to the memory system in program order (by con- 
struction). Because a load searches the memory hierarchy on the same path that a store 
form the same thread will update the memory hierarchy, and because the memory system 
will not service requests to the same location out of program order, a load cannot reorder 
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with a store from the same agent to the same location. Thus, program order is consistent 
with the order of completion with respect to a single location. 

We can establish a valid global coherent order for a given location as follows. Assume every 
copy of a location (that exists in a cache or main memory) is tagged with an imaginary 
version number and a list of the loads and stores that have accessed the location since 
it existed in the cache or memory. When a location is brought into a cache, its version 
number is set to 1 and the access list is initialized to the empty set. When a store or load 
completes, that operation is tagged with the current version number of location and then 
the version number is incremented by one. When a location is evicted from a cache, the 
accesses are appended to the access list of the same location in the next level of the memory 
hierarchy and the version numbers are all incremented by the current version number in the 
next level. The next level's version number is then incremented by the previous level's old 
version number. At the end of execution, when all blocks have been evicted from caches, 
the access list at DRAM will contain a total coherent order for all accesses to a location. 
We have previously shown that coherent order is consistent with program order. □ 

Lemma B.2. In an execution on the example system, all agents observe an apparent 
total order of all mo_sc atomics that is consistent with program order. 

PROOF. For simplicity of exposition, we first consider programs that only contain se- 
quentially consistent atomics. Later we show how to remove that constraint with the same 
end result. We also prove the total order of sequentially consistent atomics in steps by first 
considering programs that only use system-scope, then adding programs that use system- 
scope and work-item-scope, then adding work-group scope, and finally programs that use 
all scopes. 

First let us consider a program that uses a single, global system scope for all its operations. 
In the execution of that program, all caches in c(System) must be flushed/invalidated before 
a load, and thus all loads get their value from main memory. Because DRAM services seq_cst 
requests in-order, if a load LI < po L2, then LI will read from main memory before L2. The 
flushes caused by a load will flush the result of any prior store to main memory, thus we can 
be sure that if S < po L, then S will appear before L at DRAM in the same order. Similarly, 
since sequentially consistent stores flush all caches in c(S) before executing, we can be sure 
that if a store SI < po S2 in program order, SI will update main memory before S2. Finally, 
because operations later in program order cannot start before a seq_cst load completes, 
we can be sure that if there is a load L < po S, the load will complete at main memory 
before the store is written to main memory. Since DRAM services sequentially consistent 
operations in order, there is a total order of all sequentially consistent operations performed 
w.r.t. system scope that respects program order. 

Now let us consider a program that uses system-scope atomics mixed with work-item- 
scope atomics. A work-item WI will see a total order of any contiguous sequence of work- 
item-scope atomics Wa because the memory system maintains single-thread program order 
and any attempt by another work-item to observe/modify one of the locations touched by 
operations in Wa without larger-scope synchronization would form a heterogeneous race 
(a dynamic work-item scope is not inclusive with any other scope visible to another work- 
item). For any other work- item WF to observe/modify a value in W, there must exist a 
system-scope seq_cst store, R, after the last operation in Wa in program order and a seq_cst 
acquire, A, that occurs after R in the total order of system atomics. Before R executes, all 
prior operations by WI must be flushed to main memory. Those flushes may not occur in 
program order, but because another work-item cannot yet observe them without forming 
a race, it does not matter. The store associated with R must complete after all of the 
updates in Wa are flushed. Similarly, the load A must complete after R. Because the next 
sequence of work-item scope atomics, Wb, performed by WF must be similarly isolated as 
the operations in Wa, and because WF will have invalidates all caches in c(System) before 
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completing A, wc can construct an apparent total order consisting of Wa > R > A > Wb. 
Since all work-item scope atomics must be part of either a Wa or Wb sequence, and all 
system-scope atomics are total ordered, we have constructed a total order of all operations. 

Now consider a program that mixes system-scope, work-group-scope, and work-item-scope 
atomics. Similar to above, we need to show that (1) all work- items in the work-group will 
observe a total order of work-group atomics that respects program order and that (2) we can 
place any contiguous sequence, Ga, of work-group-scope atomics in the apparent total order 
formed by the work-item-scope atomics and system-scope atomics. First, we will show that 
all work-items in a work-group will observe a total order of all work-group scope atomics. 
When a work-item performs a work-group-scopc load, it will flush/invalidate its own write 
buffer first. Thus, all work-group scope loads will go through the LI cache. From here, we 
can apply the same reasoning as above for system-scope total order and main memory to 
arrive at the conclusion that all work-group-scope operations will form a total order w.r.t. 
to each other, taking into account the fact that no other work-item from a different work- 
group can affect that order without forming a race. We can combine that work-group total 
order with the constituent work-item total orders using the same technique to combine 
work-item and system orders above. Then we can combine the work-item-work-group total 
order with system order by applying the method one more time, resulting in a total order 
of all operations. 

We can include device scope into the mix using the same approach. 
A hetcrogeneous-race-free program that performs ordinary or relaxed operations in addi- 
tion to mo_sc operations will still maintain the total order of mosc operations. TODO. □ 



LEMMA B.3. An execution on the example system produces an apparent hhb.dr order 
that it is consistent with coh, and that that does not contain a cycle. 

Proof. We need to show that hhb.dr exists, that it is consistent with cob, and that it 
does not contain a cycle. We've already shown that coh, exists in Lemma B.l, and s~t> is a 
subset of coh, so so" must exist. Program order also exists, therefore hhb.dr must exist. By 
construction, hhb.dr is consistent with coh*. 

► 

Wc still need to show that there is no cycle in hhb.dr. There is no cycle in program order, 
so the only way to get a cycle in hhb.dr is if you have a chain Ap& B ~st> C ' pt> D "so Apb B . 
This chain is impossible, because it would require B ~s6 C s~t> B which implies B coh C coh B 
and we have already shown that coh does not contain a cycle. □ 

Theorem B.4. The example implementation is a valid implementation of 
HRF -direct-relaxed. 

PROOF. This follows from Lemma B.l, Lemma B.2, Lemma B.3, and the definition of 
HRF -direct-relaxed. □ 

C. MODEL EQUIVALENCES 

The HRF-Relaxed and HRF-Indircct-Relaxed definition are sufficiently flexible to describe 
all of the HRF-Dircct, HRF- Indirect, and the fine-grained svm subset of the OpenCL mem- 
ory models. In the following subsections we define each of the existing models by placing 
additional restrictions on either HRF-relaxed or HRF-indirect-relaxcd. As the astute reader 
will notice, because we are able to define each model through restrictions on HRF(-Indirect)- 
Relaxed, we can claim that the set of allowed executions in HRF(-Indircct)-Relaxed is a 
superset of the allowed executions in the three prior models, meaning that any implemen- 
tation supporting the previous models will also be correct for HRF(-Indirect)-Relaxcd. 
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C.l. HRF-direct 

We can still describe HRF-direct in terms of HRF-relaxed by re-introducing the additional 
restrictions that HRF-direct places. We can then show the models to be identical by proving 
that all race-free HRF-direct programs must be sequentially consistent on any valid HRF- 
relaxed implementation. 

To define HRF-direct in terms of HRF-relaxed, we restrict the model in two 
ways. First, HRF-direct does not support relaxed atomics, so all atomics must use 
memory -order seq-cst. Second, HRF-direct does not support generalized scope inclusion, 
so we define A ~i nc i B to specifically be scopeof(A) — scopeof(B). 

We have already proven that a heterogneous-race-free program will result in a sequentially 
consistent exeuction on an HRF -direct-relaxed implementation, and thus the models are 
equivalent. 

C.2. HRF-indirect 

TODO 
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